Remote data multicasting and remote direct memory access over optical fabrics

ABSTRACT

Today&#39;s communications require an effective yet scalable way interconnection of data centers and warehouse scale computers (WSCs) whilst operators must provide a significant portion of data center and WSC applications free of charge to users and consumers. At present, data center operators face the requirement to meet exponentially increasing demand for bandwidth without dramatically increasing the cost and power of the infrastructure employed to satisfy this demand. Simultaneously, consumer expectations of download/upload speeds and latency in accessing content provide additional pressure. Accordingly, the inventors provide a number of optical switching fabrics which reduce the latency and microprocessor loading arising from the prior art Internet Protocol multicasting techniques.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of priority as a divisionalapplication of U.S. patent application Ser. No. 16/928,370 filed Jul.14, 2020 which itself claims the benefit of priority from U.S.Provisional Patent Application 62/873,996 filed Jul. 15, 2019, theentire contents of which are incorporated herein by reference.

FIELD OF THE INVENTION

This invention relates to remote data multicasting and remote directmemory access and more particularly to exploiting optical multicasting,optical cross-connects, and optical fabrics within datacenters and datainterconnection networks for low latency communications

BACKGROUND OF THE INVENTION

Data centers are facilities that store and distribute the data on theInternet. With an estimated 14 trillion web pages on over 750 millionwebsites, data centers contain a lot of data. Further, with almost threebillion Internet users accessing these websites, including a growingamount of high bandwidth video, there is a massive amount of data beinguploaded and downloaded every second on the Internet. At present thecompound annual growth rate (CAGR) for global IP traffic between usersis between 40% and 50%. In 2015 user traffic averaged approximately 60petabytes per month (60×10¹⁵ bytes per month) and is projected to growapproximately 160 petabytes per month in 2020. In 2020 this representsapproximately 185 Gb/s user traffic or external traffic over theInternet.

However, the ratios between intra-data center traffic to externaltraffic over the Internet based on a single simple request beingreported as high as a 1000:1 this represents approximately 185 Tb/sinternal traffic within the data centers. Further, in many instancesthere is a requirement for significant replication of content requestsfrom users, e.g. for streaming audiovisual content, leading tomulticasting rather than point-to-point (P2P) data communications withinthe data center. Accordingly, it is evident that a significant portionof communications within a data center relate to multicasting IP datawithin the data center and to the external users. Even worse is thatpeak demand will be considerably higher with projections of over 600million users streaming Internet high-definition video simultaneously atthese times.

A data center is filled with tall racks of electronics surrounded bycable racks where data is typically stored on big, fast hard drives.Servers are computers that take requests and move the data using fastswitches to access the right hard drives and either write or read thedata to the hard drives. In mid-2013 Microsoft stated it had itself over1 million servers. Connected to these servers are routers that connectthe servers to the Internet and therein the user and/or other datacenters. At the same time as requiring an effective yet scalable way ofinterconnecting data centers and warehouse scale computers (WSCs), bothinternally and to each other, operators must provide a significantportion of data center and WSC applications free of charge to users andconsumers, e.g. Internet browsing, searching, etc. Accordingly, datacenter operators must meet exponentially increasing demands forbandwidth without dramatically increasing the cost and power of theinfrastructure. At the same time consumers' expectations ofdownload/upload speeds and latency in accessing content provideadditional pressure.

Accordingly, it would be beneficial to identify a means to reduce boththe latency and microprocessor loading arising from the prior art IPmulticasting techniques.

Other aspects and features of the present invention will become apparentto those ordinarily skilled in the art upon review of the followingdescription of specific embodiments of the invention in conjunction withthe accompanying figures.

SUMMARY OF THE INVENTION

It is an object of the present invention to address limitations withinthe prior art relating to remote data multicasting and remote directmemory access and more particularly to exploiting optical multicasting,optical cross-connects, and optical fabrics within datacenters and datainterconnection networks for low latency communications.

In accordance with an embodiment of the invention there is provided amethod of routing data comprising:

-   providing a plurality M of first switches, each first switch for    coupling to a plurality of electronic devices;-   providing a second switch coupled to the plurality M of first    switches;-   interconnecting the second switch to the plurality M of first    switches with a plurality of first optical links;-   providing within a predetermined first optical link of the plurality    of first optical links a first optical splitter providing a    plurality of first outputs, wherein a first output of the plurality    of first outputs forms part of the predetermined first optical link    of the plurality of first optical links and the remainder of the    plurality of first outputs are each coupled to predetermined first    switches of the plurality of first switches.

In accordance with an embodiment of the invention there is provided amethod comprising:

-   a first switch for coupling to a plurality of electronic devices    comprising a plurality of first ports and a second port, each    electronic device comprising a transmit port coupled to a    predetermined first port of the first switch and a receive port    coupled to a predetermined first port of the first switch; wherein-   the plurality of transmit ports from the plurality of electronic    devices are connected in parallel to the first switch;-   the plurality of receive ports from the plurality of electronic    devices are connected in parallel to the first switch; and-   the second port of the first switch is coupled to an optical    multicast module comprising a plurality of output ports, each output    port coupled to a predetermined electronic device.

In accordance with an embodiment of the invention there is provided anetwork comprising:

-   a plurality M of first switches, each first switch for coupling to a    plurality of electronic devices;-   a second switch coupled to the plurality M of first switches;-   a plurality of first optical links interconnecting the second switch    to the plurality M of first switches; wherein-   within a predetermined first optical link of the plurality of first    optical links a first optical splitter providing a plurality of    first outputs, wherein a first output of the plurality of first    outputs forms part of the predetermined first optical link of the    plurality of first optical links and the remainder of the plurality    of first outputs are each coupled to predetermined first switches of    the plurality of first switches.

In accordance with an embodiment of the invention there is provided anetwork comprising:

-   a first switch for coupling to a plurality of electronic devices    comprising a plurality of first ports and a second port, each    electronic device comprising a transmit port coupled to a    predetermined first port of the first switch and a receive port    coupled to a predetermined first port of the first switch; wherein-   the plurality of transmit ports from the plurality of electronic    devices are connected in parallel to the first switch;-   the plurality of receive ports from the plurality of electronic    devices are connected in parallel to the first switch; and-   the second port of the first switch is coupled to an optical    multicast module comprising a plurality of output ports, each output    port coupled to a predetermined electronic device.

In accordance with an embodiment of the invention there is provided amethod of multicasting comprising:

-   providing a passive optical cross-connect fabric;-   providing a set of first nodes, each first node connected to an    input port of the passive optical cross-connect fabric and    transmitting on a predetermined wavelength of a set of wavelengths;-   providing a set of second nodes, each second node connected to an    output port of the passive optical cross-connect fabric;-   transmitting data from a predetermined subset of the set of first    nodes to a predetermined subset of the set of second nodes using a    direct memory access protocol; wherein-   all messages broadcast by each first node of the set of first nodes    are broadcast to all second nodes of the set of second nodes.

In accordance with an embodiment of the invention there is provided amulticast fabric comprising:

-   a passive optical cross-connect fabric;-   a set of first nodes, each first node connected to an input port of    the passive optical cross-connect fabric and transmitting on a    predetermined wavelength of a set of wavelengths;-   a set of second nodes, each second node connected to an output port    of the passive optical cross-connect fabric; wherein data    transmitted from a predetermined subset of the set of first nodes to    a predetermined subset of the set of second nodes employs a direct    memory access protocol; wherein-   all messages broadcast by each first node of the set of first nodes    are broadcast to all second nodes of the set of second nodes.

Other aspects and features of the present invention will become apparentto those ordinarily skilled in the art upon review of the followingdescription of specific embodiments of the invention in conjunction withthe accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will now be described, by way ofexample only, with reference to the attached Figures, wherein:

FIG. 1 depicts data center network connections according to the priorart;

FIG. 2 depicts a software-defined RDMA over Converged Ethernet (RoCE)multicasting architecture for downstream multicasting within a rack orin-between racks according to an embodiment of the invention;

FIGS. 3A and 3B depict a logical layout for an offload multicastmethodology for use within a rack or in-between racks according to anembodiment of the invention;

FIG. 4 depicts a physical implementation of the offload multicastmethodology according to an embodiment of the invention depicted in FIG.3;

FIGS. 5A to 5C depict network interface controllers (NIC) P2P within-rack multicast according to an embodiment of the invention;

FIGS. 6A and 6B depict logical layouts for NIC P2P according to theembodiment of the invention depicted in FIG. 5;

FIGS. 7A and 7B depict physical layouts for NIC P2P according to theembodiment of the invention depicted in FIG. 5;

FIGS. 8 and 9 depict schematically data center interconnectionconfigurations according to embodiments of the invention wherein datacenters exploit optical multicasting for multicast TCP IP communicationswithin a three-dimensional (3D) architecture;

FIG. 10A depicts schematically the multicasting of large data objectswithin a data center according to the prior art;

FIG. 10B depicts schematically the multicasting of large data objectswithin a data center according to an embodiment of the invention;

FIG. 10C depicts schematically the multicasting of large data objectswithin a data center according to an embodiment of the invention;

FIG. 11 depicts a graph of the ratio of N*N to N factorial as a functionof the number of ports of a network, N;

FIG. 12 depicts schematically an optical broadcast and selectarchitecture supporting remote direct memory access over a passiveoptical cross-connect fabric enhanced with wavelength divisionmultiplexing according to an embodiment of the invention;

FIG. 13 depicts schematically a 1:N fan-out crossbar of an opticallydistributed broadcast select switch according to an embodiment of theinvention;

FIG. 14 depicts a comparison between software remote direct memoryaccess (RDMA) over converged Ethernet (RoCE) according to an embodimentof the invention with prior art software and hardware based RoCE;

FIG. 15 depicts schematically a scale out into multi-dimensionsemploying optically distributed broadcast select switches;

FIG. 16 depicts an optical micrograph of a proof-of-concept test bedemployed in obtaining the results in FIG. 14; and

FIG. 17 depicts a classical binomial multicast according to the priorart.

DETAILED DESCRIPTION

The present invention is directed to remote data multicasting and remotedirect memory access and more particularly to exploiting opticalmulticasting, optical cross-connects, and optical fabrics withindatacenters and data interconnection networks for low latencycommunications

The ensuing description provides exemplary embodiment(s) only, and isnot intended to limit the scope, applicability, or configuration of thedisclosure. Rather, the ensuing description of the exemplaryembodiment(s) will provide those skilled in the art with an enablingdescription for implementing an exemplary embodiment. It beingunderstood that various changes may be made in the function andarrangement of elements without departing from the spirit and scope asset forth in the appended claims.

Historically, datacenter interconnections for a given customer took theform of a few cross-connects measuring tens of meters within a singledatacenter. As needs have arisen for resilient hyperscale datacenters,cross-connects have increased to several hundreds of meters within thedatacenter and have been extended to several tens of kilometers acrossdatacenters within the same metropolitan market. At the same timecustomers today may employ public storage, commonly referred to asPublic Cloud, or private storage, known as Private Cloud. Others mayexploit a combination of both, Hybrid Cloud, services. Others may employmultiple Public Cloud services in what is known as a Multi-Cloudenvironment. Others may combine both a Hybrid Cloud and Multi-Cloudservice combination, known as Hybrid Multi-Cloud, (HMCloud).Accordingly, new functionalities are required in datacom networks inorder to enable the capabilities that are sought for by datacentercustomers.

At the same time as supporting increased data flow, increased customerexpectations and lower costs, no compromises can be made on thereliability of cloud computing communications that occur inside thedatacenter, between datacenters and in the access of datacenters. Toachieve what may be considered telecommunications grade resiliencyrequirements then cloud computing vendors need to consider issues suchas geographic failover and load balancing across multiple datacenterfacilities within a given metro market.

It thus follows that the most successful datacenters will be those whowill also host seamlessly interconnected services from multiple diversefacilities within the same metropolitan market. In the past, it wassufficient to interconnect the datacenter hosted enterprise cloudinfrastructure with the one on its premises. However, HM clouds requiremultipoint connectivity with many more degrees of interconnection toallow multiple cloud providers to reach both the datacenter hosted andthe on premise enterprise private datacenter. Further, the accessibilityof wavelength division multiplexed (WDM) passive optical network (PON)technology allows links capable of interconnecting HM Clouds to spanacross multiple datacenters that can be several kilometers apart.Further, fiber optic network operators are now seeking to consolidatemultiple smaller points of presence into larger datacenters in order toreduce their operational expenditures.

1. Managing Oversubscription to Control Costs in Two-Tier Leaf-SpineArchitectures

The majority of hyperscale datacenter networks today are designed arounda two-tier leaf/spine Ethernet aggregation topology leveraging veryhigh-density switches such as the one depicted in FIG. 1. Within thistwo-tier leaf/spine topology, the oversubscription ratio is defined asthe ratio of downlink ports to uplink ports when all ports are of equalspeed. With 10 Gbps server interfaces, and considering these as part ofa 3:1 oversubscribed architecture, then 40 Gbps of uplink bandwidth tothe spine switches is necessary for every 12 servers, i.e. 12×10Gb/s=120 Gb/s downlink bandwidth. The 3:1 threshold today beinggenerally seen as a maximum allowable level of oversubscription and isboth carefully understood and managed by the datacenter operators.Accordingly, a 3:1 oversubscribed leaf/spine/core architecture iscommonly deployed in order to support communications within and betweena pair of data centres, As depicted Data Centre A 110 and Data Centre120 generally consists of servers 130 interconnected by 10 Gbps links toTop of Rack (ToR) Ethernet switches that act as first level aggregation,the ToR leaf switches 140. These ToR leaf switches 140 then uplink at 40Gbps into end of row (EoR) Ethernet switches, which act as the spineswitches 150 of the leaf/spine topology. As an example, with a 48-portToR switch of 10 Gbps per port, ensuring a maximum 3:1 oversubscriptionratio requires that the ToR switches have 16 uplink ports at 10 Gbps oralternatively, 4 ports at 40 Gbps. Then in order to enable connectivityacross datacenters, the spine switches then connect at 100 Gbps to corerouters 160, which then in turn interconnect to optical coreinfrastructure made up metro/long-haul DWDM/ROADMs transport platforms.

Each leaf switch 140 must connect to every spine switch 150 in order toensure that the network is never oversubscribed at any location beyondthe chosen oversubscription threshold. By using such a network topology,and leveraging an equal cost multi-path protocol (ECMP), it is thenpossible to have an equal amount of bandwidth across the aggregated pathbetween the upstream and downstream thereby providing a non-blockingnetwork architecture via multiple aggregated links. It would be evidentthat the number of uplinks on the leaf switches 140 limits the number ofspine switches 150 to which they can connect whilst the number ofdownlinks on the spine switches 150 then limits the number of leafswitches 140 that can be part of the overall network.

Consequently, the number of computer servers that can be added totwo-tier leaf/spine data center network architecture is a directfunction of the number of uplinks on the leaf switches. If a fullynon-blocking topology is provided, then the leaf switches are requiredto have as many uplinks as downlink interfaces to computer servers.Today, 10 Gbps is the default speed of network interfaces of data centerservers and hence, with the number of servers required to support thegrowth of Hybrid/Multi-Cloud services etc. requiring much larger andmore centralized data centers, it has become challenging to designnon-blocking and cost-effective data center networking fabrics.

Whilst, this leaf/spine/core architecture is the most pervasive mannerof providing any-to-any connectivity with a maximum amount of bisectionbandwidth within and across data centers it is not without itslimitations. One such limitation is latency due to the requirement toroute by at least one leaf switch or more typically via two leafswitches and two or more spine switches and/or core routers according tothe dimensions of the data center, the uplink capacity, downlinkcapacity, location(s) of the servers being accessed, etc. Accordingly,within the prior art alternative architectures have been proposed suchas chordal networks and spine ring networks. Considering the former thena 32 node chordal ring network is formed from 32 EoR spine switches in aring wherein each spine switch is addressed from another spine switch bythe selection of the wavelength upon which the data is transmitted.Accordingly, there the number of spine switches/core switches traversedmay be reduced through Dense Wavelength Division Multiplexing (DWDM)based chordal ring architectures as rather than routing data throughmultiple spine and/or core switches the data routed from a node basedupon wavelength wherein the N^(th) wavelength denotes the N^(th) nodearound the ring.

Within other prior art developments to address the drawbacks withintwo-tier leaf-spine networks have included the addition of directconnectivity between spine switches rather than requiring routing via acore router and the provisioning of increased connectivity between leafswitches and spine switches such that each leaf switch is connected tomultiple spine switches. However, within data center inter-connectionnetworking scenarios these approaches maintain centralized switchingfunctionality requiring extra network links be traversed, commonlyreferred to as increasing the number of hops, which in turn increaselatency, increase cost, and increase power consumption. Three keyfactors that cloud data storage providers and data center operators areseeking to lower. Accordingly, it would be evident that solutions toreduce latency and increase effective transmission capacity would bebeneficial within data center environments as well as otherenvironments. One such solution is the provisioning of broadcast (ormulticast) capabilities within a network such as data center exploitingInternet Protocol (IP) based communication methodologies. Anothersolution, as will be described subsequently below is the provisioning ofintermediate multicast layers to bypass routing to the higher layerspine and core switches.

As noted supra network-intensive applications like networked storage orcluster computing require a network infrastructure which provides a highbandwidth and low latency. Accordingly, systems today send data over anetwork using the Internet Protocol where data is sent in fixed-lengthdata records, commonly referred to as packets, which comprise a “header”followed by a “data section”. To ensure that all the packets that getsent arrive at their destination IP links commonly exploit theTransmission Control Protocol (TCP) which runs on top of IP, and takescare of the overhead processes of making certain that every packet sentarrives and splitting/joining the “continuous stream” of bytes to/fromthe packets. Accordingly, within data centers exploiting Ethernet linksin the prior art TCP/IP is the common link format.

As TCP is a “connection oriented protocol”, this means that prior toexploiting it the system must first “establish a connection” with oneprogram taking the role of a “server”, and another program taking therole of a “client.” The server will wait for connections, and the clientwill make a connection. Once this connection has been established, datamay be sent in both directions reliably until the connection is closed.In order to allow multiple TCP connections to and from a given host,“port” numbers are established. Each TCP packet contains an “originport” and a “destination port”, which is used to determine which programrunning under which of the system's tasks is to receive the data. Thisoverall identification of an application process through the combinationof the IP address of the host it runs on—or the network interface overwhich it is talking, to be more precise—and the port number which hasbeen assigned to it. This combined address is called a socket.

Internet socket APIs are usually based on the Berkeley sockets standard.In the Berkeley sockets standard different interfaces (send and receive)are used on a socket. In inter-process communication, each end willgenerally have its own socket, but as these may use differentapplication programming interfaces (APIs) they are abstracted by thenetwork protocol.

In contrast to TCP/IP a datacenter may exploit Remote Direct MemoryAccess (RDMA) which is a direct memory access from the memory of onecomputer into that of another without involving either computer'soperating system. RDMA permits high-throughput, low-latency networking,which is especially useful in massively parallel computer clusters andtypically offers lower latency, lower CPU load and higher bandwidth thanTCP/IP. The exploitation of the RDMA over Converged Ethernet (RoCE)protocol allows even lower latencies to be achieved than earlier RDMAprotocols.

Accordingly, whilst historically communications links were primarilyTCP/IP based with small levels of RDMA supported by the server's NIC theinventors define their links as being primarily RDMA based with smalllevels of TCP/IP supported by the server's NIC. A small levels of TCP/IPcommunications remains as not all communications from a server within adatacenter will be direct transfer to another server within a datacenteras overall management functions and some data transfers will be to anapplication through a socket. Exploiting RoCE for the RDMA processesallows communication between any two hosts within the same Ethernetbroadcast domain.

Accordingly, referring to FIG. 2 there is depicted a software definedRDMA multicast within a rack and in-between racks according to anembodiment of the invention. Accordingly, there are depicted first andsecond racks 200A and 200B each comprising an array 210 of 16 servers inconjunction with a Top-of-Rack leaf switch (ToR-LS) 220. Each ToR-LS 220is coupled to an End-of-Row (EoR) switch (ER-SW) 240 via dedicated linkssuch that the ER-SW 240 has a first transmitter (Tx) 250A and firstreceiver (Rx) 260A assigned to each ToR-LS 220 which itself has a secondTx 250B and second Rx 260B. As depicted in first rack 200A the ToR-LS220 communicates with each server (not identified for clarity) via adedicated downlink and dedicated uplink as known within the prior art.According, in the event of a multi-cast (MC) message being transmittedby a server within first rack 200A this is received at the ToR-LS 220wherein it is both transmitted to the ER-SW 240 via an RoCE transmissionand to each server within the first rack 200A. As the first rack 200Aexploits dedicated downstream links then ToR-LS 220 employs a softwarebased RoCE process to replicate the MC message and provide it into thememory of each server within the first rack 200A over the dedicated linkto each server. Accordingly, the Soft-RoCE is performed by the ToR-LS220. Similarly, ER-SW 240 executes a Soft-RoCE process for the receivedMC message to replicate and transmit the MC message to each ToR-LS 220of the other racks it is coupled to via optical links.

In contrast, second rack 200B similarly exploits dedicated upstreamlinks from each server to the ToR-LS 220 and a series of dedicateddownstream links but it provides for an overlay optical MC via MCtransmitter 230 and links to each server within the rack 210 of secondrack 200B. Accordingly, the ToR-LS 220 upon receiving an MC messagerather than transmitting this to each server via a software replicationprocess with RDMA (e.g. software-defined RDMA) provides this to the MCtransmitter 230 wherein it generates an optical MC message which ispassively split and coupled to each server. Accordingly, it would beevident that a range of options exist for providing the MC message inconjunction with the non-MC messages provided from the ToR-LS 220. Theseinclude, but are not limited, to:

-   -   defining a time-slot for MC messages so that the MC transmitter        230 occupies a time-slot or time-slots without conflicting        non-MC messages wherein the MC transmitter may operate in the        same wavelength band as the non-LC messages;    -   the MC messages may be upon a separate wavelength with separate        receiver within each server coupled via different routing        between ToR-LS 220 and MC transmitter 230; and    -   the MC messages may be upon a separate wavelength with separate        receiver within each server multiplexed over common path from        the ToR-LS 220.

Now referring to FIG. 3A there is depicted a logical layout for anoffload multicast methodology for use within a rack or in-between racksaccording to an embodiment of the invention. Accordingly, as depicted aplurality of nodes 380 are connected in a tree to a transponder 320.Each transponder 320 being connected to an optical fabric 310.Accordingly, the optical fabric 310 and transponders 320 provideconnectivity, what the inventors refer to as a Hyper Edge connectivity,between racks whilst the nodes, e.g. servers within a rack, areconnected in a binomial multicast configuration within a rack. The HyperEdge provides for an offloading of the multicast between racks. This isachievable as concepts such as Multicast Service Function Tree (MSFT)are flexible enough to support a software/hardware hybrid solution formulticasting. Within other embodiments of the invention the ToR switchwithin each rack may also support 1:N or K″N lossy multicast.

Referring to FIG. 3B there is depicted a logical layout for an offloadmulticast methodology for use within a rack or in-between racksaccording to an embodiment of the invention. As depicted an opticalfabric 310 coupled to a plurality of transponders 320 each coupled torack assembly 330. Within the rack assembly 330 is a SP-SW 340 which iscoupled to a plurality of ToR-LS 350 each of which is coupled to aplurality of servers within first rack 360. The servers (not identifiedfor clarity) are depicted within a binary tree configuration withrespect to the ToR-LS 350 rather than being in parallel through discreteconnections to the ToR-LS 350 such as depicted in second rack 370. Theoptical fabric 310 couples each optical transmitter within thetransponders 320 to each optical receiver within the transponders 320.

Accordingly, the offload multicast methodology according to anembodiment of the invention exploits a plurality of transponders, anoptical fabric and software installed upon all servers. The transpondersare linked to existing commodity ToR switches, for example with a 10Gb/s port for outgoing data and multiple 10 Gb/s ports for incomingmulticast. The servers send the outgoing multicast package to atransponder via Transmission Control Protocol (TCP). The transponderssend multicast traffic to all other racks through the “lossless” opticalfabric. Accordingly, data packages are copied multiple times through theoptical fabric, e.g. 24 or 32 times to each other rack. The transponderswithin the receiving racks pick the data packages for that rack whichare then sent to the servers via TCP links. Each server directly managesthe binomial copy to the remainder of the servers. Optionally, if theToR supports lossy multicast then the transponder can also employ thatto send packages to servers, and use Negative Acknowledgement (NACK) andTCP to deliver missed packages

Within a rack the performance is limited by the ToR and the binomialalgorithm. Between racks the hardware according to an embodiment of theinvention such as described and depicted in FIGS. 3A and 3B enable aone-to-all copy for all other racks, e.g. 24 or 32 racks, within atimeframe of approximately 200 ns. However, these will be delayed byfactors such as buffering. The lossy ToR hardware multicast isbeneficial where the loss ratio is predictably low.

In the scenario where a package is sent from one server to all then itsends the package via TCP through the ToR to the transponder, whichtypically takes less than 1 μs. Then, the optical fiber copies this toall other transponders, that takes less than 200 ns. Next, thetransponder uses the ToR to do lossy copy, which typically takes lessthan another 1 μs. Accordingly, it is evident that the architecturedepicted within FIGS. 3A and 3B results in a package being transmittedto any subset of servers or all servers in well under the typical figureof 8 μs typical server clusters achieve today with prior art solutions.

Within embodiments of the invention the number of drop ports, N, may begreater than 1 as the architecture enables multiple multicast.Accordingly, more than one of the transponders can send large dataobjects (objects) at the line rate to all transponders. Thesetransponders are capable to take them all and then aggregate the datapackages to the output N ports. These are then sent via the ToR to thetarget servers via TCP or lossy User Datagram Protocol (UDP). This canbe beneficial in a first scenario where the rack has only one receivinggroup for the same package wherein the transponder can use all N portsto copy the same package to N sub-groups of servers. It would also bebeneficial where the rack has N receiving groups for different packagesas the transponder can send to them all via different portsimultaneously.

The alternative to multicast is point-to-point (P2P). When consideringP2P then it is a pure software solution offering reasonable performance.However, it takes almost the whole network to do that one job, Even,with 2:1 over-provisioning the performance is still non-deterministicand poor even though the cost of the network has doubled. For largenumbers of nodes, the statistical fluctuations can be totally out ofcontrol and drive the performance to be very poor.

The server factor makes the value proposition of multicast morecompelling. Within a P2P based multicast, all servers in the tree areeither occupied or wasted. With a fully blocked network interface card(NIC) how can a server transmit. Adding additional NICs as noted abovehas significant cost impacts but does not solve the problem and is amajor impact on costs versus the addition of more ToR/Spine switches.Accordingly, the embodiments of the invention provide benefits tooperators including, for example, low latency and uneven (much higher)receiving bandwidth, reduced cost, reduced power consumption, andcompatibility with the current network

FIG. 4 depicts a physical implementation of the offload multicastmethodology according to an embodiment of the invention depicted in FIG.3. As depicted first and second racks 400A and 400B respectively arecoupled to optical fabric 450 via transmitters 430 and receivers 440,respectively. As depicted first rack 400A comprises a ToR-LS 420 withrack 460 wherein the servers (not identified discretely for clarity) areconnected to the ToR-LS 420 via discrete transmit and receive channelsin parallel. In second rack 400A the ToR-LS 420 is again coupled to thediscrete servers within the rack 430 via discrete channels on thetransmit side to ToR-LS 420. However, on the receive side of the serversthey are connected from the ToR-LS 420 via an optical multicast network410 as opposed to discrete parallel channels and/or parallel MCmulticast such as depicted in respect of FIG. 3.

Referring to FIG. 5A there is depicted a network topology prior todeploying a network interface controller (NIC) P2P with in-rackmulticast according to an embodiment of the invention. As depicted aplurality of server racks 540 are coupled to ToR-LS 550 via discretechannels 545. Each ToR-LS 550 is coupled to an EoR-SW 530 (or splineswitch) which are in turn coupled to core switches (COR-SW) 520 andnetwork router (NET-R) 510. Links between tiers of the hierarchy are viadedicated links with bandwidths in upstream/downstream and number ofsubsidiary elements in lower tier to an element in higher tier aredefined by the subscription ratio. These links are differentiatedbetween inner links 580 and outer links 570.

Now referring to FIG. 5B the network topology depicted in FIG. 5A isrepeated but now each outer link 570 has been replaced with a multicastnetwork (MC-NET). Accordingly, the outer link 580 between NET-R 510 andCOR-SW 520 is replaced by first MC-NET 590A, the outer link 580 betweenthe COR-SW 520 and EoR-SW 530 is replaced by second MC-NET 590B, and theouter link 580 between the EoR-SW 520 and ToR-SW 550 is replaced bythird MC-NET 590C. Each of the first MC-NET 590A, second MC-NET 590B,and third MC-NET 590C accordingly multicasts the signal from the higherlevel to all corresponding elements in the lower tier. Within FIG. 5Cthe network topology is depicted within another variant to that depictedin FIG. 5B wherein the first MR-NET 590A, second MR-NET 590B and thirdMR-NET 590C are depicted disposed between their respective tiers withinthe network but not rather than replacing the outer links 570 they aredepicted disposed in addition to the outer links 570 and inner links580.

Referring to FIGS. 6A and 6B there are depicted logical layouts for NICP2P according to the embodiment of the invention as depicted in FIG. 5Band a variant thereof. Within FIG. 6A the NET-R 510, COR-SW 520, andEoR-SW 530 are depicted within their tiers with the outer links 570 andinner links 580. Each outer link 570, except that between NET-R 510 andCOR-SW 520 has all outer links 580 replaced with MC-NET 600 whichconnect to each element within the lower tier including the elementcoupled to the higher tier via an inner link 580. The logical network inFIG. 6B is essentially the same as that depicted in FIG. 6A with theexception that those elements within the lower tier connected to anelement within the upper tier via an upper link 580 are not connected toan MC-NET 600 whereas all other elements in the lower tier are connectedto the MC-NET 600.

Now referring to FIGS. 7A and 7B there are depicted physical layouts forNIC P2P according to the embodiment of the invention depicted in FIG. 5.In FIG. 7A an upper layer 710 is coupled to a lower layer 730 via aplurality of MC-NETs 720 which are grouped in dependence upon their portcounts. As depicted first and second MC-NETs 720A and 720B are coupledto first to fourth rack arrays 730A to 730D respectively wherein each ofthe first and second MC-NETs 720A and 720B are connected to all of thefirst to fourth rack arrays 730A to 730D, respectively. In contrast inFIG. 7B multiple MC-NETs 720 are replaced with a larger MC-NET 740.

With respect to the performance of the exemplary embodiments of theinvention depicted in FIGS. 3A to 7B there is still the issue ofperformance at massive scale as potentially bottlenecks will be causedby delays in the receivers setting up to receive the next message in.For example, consider a network with a million nodes. If even one is notready to receive the next packet, then the whole cluster waits becausethe sender cannot send until the “ready” bit aggregates and is visibleto the NIC. So, this ultimately becomes the limiting factor.Accordingly, it would be beneficial to ensure that the receivers areaware of the full transfer so that they can loop receiving and do nothave to set up separately on or for each request.

Further, it would be beneficial to use fixed block sizes, particularlyfairly large blocks, with multiples of a page and to page align them.Whilst the receivers do not care where the data appears in memory, asthey just want the pages, it would be beneficial to remap them in orderto make them contiguous in the application. Further, it would bebeneficial for the network to support multiple concurrent senders fordifferent distinct uses so that when, rather than if, a cluster isdelayed for transfer A, transfer B is also underway, and on average thenetwork is active and busy. A special case exists for small transfers,unlike video for example which has huge streams of data at steadyspeeds, such as real-time “events” which are often encoded in dataobjects that might be just 16 or 32 bytes, but the event rate could bemassive.

Now referring to FIG. 8 there is depicted schematically a data centerinterconnection configuration 800 according to an embodiment of theinvention wherein data centers exploit optical multicasting formulticast communications. Accordingly, the configuration employs arraysof data centers, Data Center A 440 and Data Center B 460, each havingmemory and storage, associated with it wherein these data center arrayseach represent one tier of R tiers. Within each tier data centers areconnected through Torus A 870 and/or Torus B 880 although they mayoptionally include portions of first Data PDXN (Hyperedge/AWGR 1) 810and/or second Data PDXN (Hyperedge/AWGR 2) 830 along that tier's edge.Data centers across multiple tiers are connected through Torus A 870and/or Torus B 880 in conjunction with the first Data PDXN(Hyperedge/AWGR 1) 810 and/or second Data PDXN (Hyperedge/AWGR 2) 830.However, there are now depicted a pair of Hyperedge MC PDXNs 820 whichare coupled to two edges of each tier in common with first Data PDXN(Hyperedge/AWGR 1) 810 and second Data PDXN (Hyperedge/AWGR 2) 830. Assuch each data center may now exploit the low latency optical multicastmethodology such as described supra in respect of FIGS. 3 to 7Brespectively in order to provide multicast data communications tomultiple data centers within the three-dimensional (3D) array of datacenters.

In common with the embodiments of the invention described supra inrespect of FIGS. 3 to 7B the optical multicast through the Hyperedge MCPDXN 820 is coupled to the data centers on an outer edge of edge tier.It would be evident that rather than data centers that each of DataCenter A 440 and Data Center B 450 may be a cluster of racks, a row ofracks, a discrete rack, or variants thereof as well as completedatacenters or discrete servers and accordingly a ToR-LS, an EoR-SW, aCOR-SW, or a NET-R. It would also be evident that other configurationsof data centers within each plane of the three-dimensional (3D) array ofdata centers may be employed without departing from the scope of theinvention. For example, a hexagonal configuration may be employed withoptical multicast along three faces of the 3D array or a single opticalmulticast may be deployed axially to all tiers wherein within a tiermultiple tori are employed.

Now referring to FIG. 9 there is depicted schematically a data centerinterconnection configuration 900 according to an embodiment of theinvention wherein data centers exploit optical multicasting formulticast communications. Accordingly, configuration 900 comprises thesame architecture and architectural elements as configuration 800 inFIG. 8 but now an optical multicast network 910 interconnects all datacenters, e.g. Data Center A 440 and Data Center B 460, within a singletier.

It would be evident that with respect to FIGS. 8 and 9 thatalternatively rather than considering each tier as comprising an arrayof data centers that the elements of Data Center A 440 and Data Center B460 may alternatively represent EoR or ToR switches within a single datacenter and that the tiers may represent different data centers and/ordifferent regions within a common single data center.

According to embodiments of the invention whilst a single layer opticalmulticast is depicted it would be evident that through the use ofdistributed optical amplification multi-tier optical multicasting may beemployed for optical multicasting within a data center. For example, aninitial 32-way multicast layer to ToR-LS switches may be followed by asecond tier 16-way multicast layer to the servers within the rack. Suchmulti-tier multicasting may be achieved within a multi-tier architectureby associating, for example, an input and output port of an MC-NETdisposed between the ToR-LS and EoR-LS to the COR-SW above such thatlatency through the EoR-LS is removed.

Whilst within the embodiments of the invention described supra theMC-NETs are disposed within the outer links it would be evident thatalternatively the MC-NETs may be coupled to a predetermined inner linkwithin each set of links between a lower tier and a higher tier.

Now referring to FIG. 10A there is depicted schematically themulticasting of large data objects within a data center according to theprior art. Accordingly, with a rack a plurality of servers 1020 aredisposed connected in “daisy-chain” configuration to a Switch A 1010,e.g. ToR server. The Switches A 1010 are connected to first level ofswitches, Switch B 1020, which route communications from one rack toanother. A second level of switches, Switch C 1030, route communicationsbetween the first level of switches. However, when a server 1020 wishesto replicate a software load across a data center, for example, thenTCP/IP is inefficient and accordingly RDMA could be used. RDMA depositsdirectly in memory whilst RDMA over Ethernet does it over a network.With RDMA over Ethernet the distributed storage results in higherInput/Output Operations Per Second (IOPS) versus TCP/IP. A chainedsoftware replication is employed as software multicasting is inefficientdue to cascaded packet loss.

Accordingly, the inventors have addressed this problem by leveraging:

-   -   high performance software RDMA on standard NIC to provide an        easy entry point for customers to provide low latency close to        hardware solutions    -   maintaining the data in the user space which hardware solutions        do not; and    -   employing an optical overlay using a second standard NIC in each        server.

This approach allows for future enhancements through higher performanceNICs and/or enhanced optics.

Now referring to FIG. 10B there is depicted schematically themulticasting of large data objects within a data center according to anembodiment of the invention employing an optical splitter to distributesignals from the ToR Switch, Switch A 1010, via optical fibers 1050 toan additional NIC added to each server 1020. Alternatively, as depictedin FIG. 10C there is depicted schematically the multicasting of largedata objects within a data center according to an embodiment of theinvention using an optical bus 1060 with optical links 1070 from theoptical bus 1060 to the additional NICs within the servers 1020.

2. RDMA Over Passive Optical Cross-Connect Fabric

Within the preceding discussion in respect of FIGS. 2 to 9 and FIGS. 10Band 10C the discussions and analysis have been directed to data centersand the connectivity of servers within racks and between racks. Withinthese an additional optical fabric is employed to distribute data inorder to reduce the latency of the communications between racks etc.Multicast, or more specifically reliable multicast, is an importantcommunication primitive and building block in the architecture ofscalable distributed systems. However, implementing reliable multicastat large scale has to date been challenging due to limitations withexisting switch fabrics and transport-layer protocols. These switchfabrics and transport-layer protocols are primarily designed forpoint-to-point (unicast) communications, which have insufficientpermutations to support low loss-ratio multicast. So, in practice, todate reliable multicast communications are implemented as a softwareoverlay on top of the unicast network.

Multicast communications consume significant resources which scalenonlinearly with the number of endpoint nodes, often requiringimplementations to make trade-offs between latency, reliabilityguarantees, and scalability. For example, multicast applications canrange diversely from live multimedia events which are broadcast to manysubscribers (nodes), potentially millions of subscribers, in whichstrict reliability is not a critical requirement, but timeliness is, todistributed file systems where real-time performance is less critical,but data integrity is. However, many applications such as distributedcomputing are time-sensitive applications in cloud computing and otherdistributed systems requiring both high availability, strongconsistency, and low latency. These emerging applications being fueledby new technologies like Network Virtualization, 5G, the Internet ofThings (IoT), high performance computing (HPC) and artificialintelligence.

The inventors believe that a reliable multicast technique with lowintrinsic latency and the ability to scale is an important buildingblock required to address the challenges posed by these time-sensitiveapplications. Furthermore, it could also play an important role inByzantine Fault Tolerant protocols, which are becoming more appealing asusers of data and applications are increasingly more susceptible tomalicious behaviour.

However, even if we assume the switch fabric itself can be madelossless, the networking interface and protocol stack at each of thenode's memory and central processing unit (CPU) still introduce packetdrops. This can arise for many reasons, ranging from insufficientallocation of buffers to the processor's inability to keep up with therate of packet arrival and transmission. Multicast traffic would onlyexacerbate these issues, as outlined below.

2.1 Packet Loss Challenges of Multicast and Proposed Scalable Solutions

Within a cluster of networking nodes, packets sent out from the sender'sCPU go through the transmitting protocol stack layers, traverse theswitch fabric, and finally move up the receiving protocol stack layersbefore it reaches the receiving side's CPU. Along this path, packetscould be dropped due to reasons such as traffic congestion, insufficientallocation of buffers, or blocking in the switch fabric. This couldhappen at many points within the sender's stack, the switch fabric, aswell as the receiver's layer-2, 3, and 4 (L2, L3, and L4) buffers.

Most switch fabrics (especially for Ethernet) are not designed to belossless even for unicast traffic. In addition, the Ethernet/IP/TCP andUDP stacks were designed as best-effort and hence these cannot guaranteedelivery of packets. However, to achieve a reliable multicast at linerate of 10 Gb/s and beyond, the loss ratio required is lower than one ina billion. Accordingly, the inventors have addressed this through acombination of an optical switch fabric and the RDMA stack.

2.1.A Tackling Packet Loss in the L1 Switch Fabric

Multicast communication transmits information from a single source tomultiple destinations. Although it is a fundamental communicationpattern in telecommunication networks as well as in scalable paralleland distributed computing systems, it is often difficult to implementefficiently using hardware at the physical layer (L1).

Building large scale switch fabrics is challenging even for unicast(point-to-point) connections. Consider an N×N switch to represent theswitch fabric and consider the permutations of connections needed amonginputs and outputs. For a non-blocking switch (also called perfectswitch), the number of permutation assignments (maximal set ofconcurrent one-to-one connections) needs to be N! (N factorial), withthe number of cross points scaling as N{circumflex over ( )}2 (Nsquare). When N becomes large, this crossbar switch is difficult andexpensive to scale, so the switch fabric is usually implemented in amultistage switching configuration using a Clos-switch or some variationthereof. FIG. 11 depicts a graph of the ratio of N*N to N factorial as afunction of the number of ports of a network, N.

The interconnections between the internal switch stages further increasethe number of potential congestion points that can lead to packagedrops. Furthermore, even though the full Clos configuration is strictlynon-blocking for unicast traffic, oversubscription is often introducedin some of the switching stages for cost reasons, further increasing theprobability for congestion and package loss within the switch fabric.

When used in a packet-switched context for point-to-point (unicast)traffic, a perfect switch will ensure that no packet is lost within theswitch itself. Packets can still be lost outside the switch if there iscongestion before or after the switch which can cause the ingress andegress buffers to overrun.

In the presence of multicast traffic, things get more challenging. Inthis case, the crossbar switch is no longer internally non-blocking,since the number of multicast assignments needed to support arbitrarymulticast is NAN, which is significantly larger than N! (N factorial).Furthermore, multicast traffic can exacerbate congestion issues,especially at the switch egress buffers, since packets from many sourcescan be directed to the same destination output port (incast).

It is not difficult to see that the number of multicast assignmentsneeded rapidly outgrow the number of available permutation assignments,even for a relatively small port count. For example, as seen in FIG. 11,even at N=16, we would need almost 900,000 times more assignments thanwhat is available on the perfect switch.

This implies that performing multicast directly using existing switchhardware will quickly lead to blocking and loss of information, makinglow-loss-ratio multicast challenging, and practically impossible. It istherefore not surprising why multicast in today's distributed systems isoften implemented using software as an overlay on top of the unicastswitch hardware.

To overcome the aforementioned hardware limitation, the inventors havesuccessfully implemented a key physical-layer (L1) building block devicebased on a passive optical cross connection network (PDXN) by using anN×N optical coupler fabric. Optical power from each input is dividedequally among the N outputs so that no reconfiguration is needed to setup a circuit between an input and an output. Since this architecturesupports multicast, it can also support unicast too. However, if usedprimarily for unicast traffic, this architecture could be expensive.

Accordingly, referring to FIG. 12 there is depicted schematically whatthe inventors refer to as an Optical Distributed Broadcast-Select Switchsupporting RDMA over a passive optical cross-connect fabric enhancedwith wavelength division multiplexing according to an embodiment of theinvention. Accordingly, within FIG. 12 only 4 channels are depicted forsimplicity. On the left four transmitters 1210A to 1210D each transmitupon a different wavelength, λ1 to λ4, to a different input port of theOptical Broadcast Select 1200. A broadcast stage 1220, for example anoptical star coupler, couples each input of the broadcast stage 1220 toeach output of the broadcast stage 1220 such that each output nowcarries all 4 wavelengths, λ1 to λ4. Each output is coupled to a selectstage 1230 comprising four opto-electronic stages 1240A to 1240Drespectively, each of which is coupled to a receiver 1250A to 1250D.

The original PDXN design was combined with a Time Division MultipleAccess (TDMA) Protocol. However, the PDXN architecture could also beused as an Optical Distributed Broadcast-Select Switch (ODBSS) as wellwhen enhanced by WDM, as shown in FIG. 12. To do so, we assign each porta dedicated optical transmitter wavelength. At each destination portend, an optical demultiplexer followed by an array of photodetectors canbe used to implement the receiver function. In this way, the PDXN fabricworks in a distributed broadcast-and-select mode, with every port beingable to broadcast to any port, and the receiving port can select thewavelength it would like to pick up.

Due to the wide and inexpensive bandwidth available in the optical fibermedium, this optical-based architecture can work in a distributedmanner. Unlike the old-fashioned electronics-based design which has tocomplete the selection job within a centric switch chip, channelselection in an optical-based design can be delayed to the end-points,making it much easier to align with end-point subscription policies.This architecture has N{circumflex over ( )}3 interconnections insidewhich can support NAN permutations.

One familiar with switch fabric architectures would notice thesimilarity between an ODBSS and a crossbar with fan-out. In fact, theODBSS design could be considered as a crossbar with full 1:N fan-outwhich has NAN permutation as shown in FIG. 13. By being able to achievea full fan-out, the ODBSS is capable to offer arbitrary multicast withNN permutations within. As depicted in FIG. 13 a set of transmitters Tx1to Tx5 1310A to 1310E respectively form a series of optical busses whichare “tapped” and coupled to a set of selectors 1330A to 1330E,respectively. These being implemented by optical fibers 1320 forexample. Each selector couples to a receiver, first to fifth receivers1340A to 1340E, respectively. The matrix of “taps” being referred to asa Broadcaster 1350.

In today's widely-deployed commercial optical modules, an 80wavelength-channel system based on DWDM (Dense Wavelength DivisionMultiplexing) is already practical. Accordingly, these architectures cansupport up to 80 ports using the ODBSS fabric directly or with a largerport count optical amplifiers can be used within the fabric tocompensate for the higher losses and maintain a suitable link budget.The inventors note that the maturity of the optical component and moduleindustry have led to a dramatic cost reduction over the last twodecades. Therefore, such device can be built out of cost-effective,off-the-shelf optical modules and components.

2.1.B Tackling Packet Loss in Receiving Buffers

Buffer misalignment in communication stacks is another major factor forfailure to achieve low loss-ratio multicast. This can happen indifferent layers that refer to memory buffer allocation actions. Todeliver the message to processes (CPU), a reliable receiving mechanismis required. In standard TCP/IP architecture, reliable delivery isguaranteed by layer 4 protocol TCP (Transmission Control Protocol).Despite its ability to ensure lossless delivery for unicast traffic, TCPcannot be used as an L4 protocol for multicast because as aconnection-based protocol, TCP has no mechanism to handle one-to-manyconnections. On the other hand, with UDP, a multicast over IP (L3) ispractical, but the delivery reliability is never guaranteed.Furthermore, due to the standard protocol stack implementation on theLinux platform, the kernel would allocate socket buffer for eachethernet frame received and copy payload from kernel space to user spaceapplications after. This could amplify buffer mis-alignment problems andtrigger a high loss rate in the upper layer protocols. When theinventors measured UDP loss over a good one-to-one physical connection,the loss-ratio obtained was as high as 20% initially. With careful finetuning of the kernel buffer and traffic load, the loss ratio can beimproved but is still often beyond one percent.

Ideally, a message-based L4 protocol with pre-allocated buffers forreceiving messages and working in tandem with a lossless ODBSSarchitecture in L1 would be appropriate for a low-loss multicast system.Based on this understanding, the inventors explored RDMA (Remote DirectMemory Access), which is a protocol developed for high performancecomputing. In RDMA specifications, two datagram-based queue pair types,namely Reliable Datagram (RD) and Unreliable Datagram (UD), couldpotentially be used for multicast. However, among all the known RDMAimplementations today, none of them supports Reliable Datagram and someof them do not support multicast at all. This is not surprising and islikely due to the lack of a powerful switch that can support lowloss-ratio multicast.

InfiniBand, RDMA over Converged Ethernet (RoCE) and Internet Wide AreaRDMA Protocol (iWARP) are the three major implementations of RDMAcommonly used in industry. Among them the best-known implementation isInfiniBand. RoCE, which leverages the low-cost and ubiquitousIP/Ethernet ecosystem, is now being deployed in datacenters.

The inventors employ RDMA Datagram (UD) transport, which has apre-allocated resource on both the sender and receiver sides. In theirproof-of-concept work, the inventors experimented with RoCEhardware-based Network Interface Cards (NICs) from different vendors.Using these, we were able to achieve a multicast loss ratio level of theorder of one per million in our lab, which was much better than what ispossible with UDP. However, without access to the internalhardware/firmware, the inventors were not able to determine if thiscould be further improved. Therefore, the inventors turned to Soft-RoCEwhich is an open-source software implementation of RoCE. With somedebugging and improvement of the software, we were able to get themulticast datagram feature to work successfully; in doing so, theinventors succeeded in sending over 68 billion multicast packagesthrough our prototype PDXN fabric without any packet loss.

Using a Perftest package, the inventors performed message latencybenchmarking tests using two different RoCE hardware NICs (Mellanox andEmulex), comparing the hardware RoCE performance with inventors ownSoft-RoCE, hereinafter referred to as Viscore-improved Soft-RoCE, aswell as the open-source Soft-RoCE. The inventors carried out latencytesting using both RDMA Datagram and RDMA RC (Reliable Connection).Since the RDMA Datagram size is limited by the MTU (which is 4096bytes), the inventors used RDMA RC to extend the testing to largermessages. The results of the Viscore-improved Soft-RoCE together withthe OpenSource SoftRoCE and Hardware RoCE are presented in FIG. 14 fordata packages from 2 bytes to 8,388,608 bytes. The inventors found thattheir implementation achieved better performance than open-sourceSoft-RoCE, by improving latency and throughput performance of Soft-RoCEby 2× and 3×, respectively.

2.1. C Scaling the Multicast in Multi Dimensions

For larger port counts, one can leverage a multi-dimensional approach,as shown in FIG. 15 or as depicted in FIGS. 8 and 9, to scale thenetwork to ND ports, in which D is the number of dimensions, and N isthe number of nodes within a dimension. When data packets move from onedimension to another, they go through anOptical-to-Electrical-to-Optical (OEO) conversion. This enables opticalwavelengths to be re-used in different dimensions, facilitating theability to scale. For example, a three-dimensional system based on 40wavelengths can support up to 40×40×40=64K ports. Similarly, an 80-portODBSS can potentially scale up to 512K ports. Within FIG. 15 a series of“horizontal” optical cross-connections 1510 are coupled to a series offirst nodes 1520. A plurality of second nodes 1530 are connected to a“vertical” optical cross-connection 1540. A subset of the nodes, thirdnodes 1550 are connected to each of a ““horizontal” opticalcross-connection 1510 and a “vertical” optical cross connection 1540.Within the architecture depicted in FIG. 15 each node, whether firstnode 1520, second node 1530 or third node 1550, is only

It should be noted that, in the multi-dimension scaling method, thenodes in between dimensions are filtering the multicast packets to itssub-nodes. If over-subscription happens, then these nodes will beexposed to the risk of higher ratio packet loss. Therefore, whendesigning upper layer protocols, one should bear this in mind tocarefully control the over-subscription policy.

Nevertheless, since the ODBSS works in a distributed manner, anyover-subscription only affects the end-nodes, not the fabric in between,thus limiting the loss risk to within a subnet or the end-nodes alone.This is in contrast to a centric switch-based architecture, in whichthere is a well-known risk of broadcast storms that affect the entirenetwork [11].

2.2 Low Latency and Low Loss Implementation

2.2.A. Implementation and Proof-of-Concept Test-Bed Setup

The inventors built a proof-of-concept test-bed using four computernodes connected together by a 12-port PDXN module. Standard commercialDWDM 10 Gb/s SFP+transceivers and optical de-multiplexers were used tocomplete an ODBSS implementation for the four nodes. With this setup,the inventors then tested RDMA UD multicast over IP/Ethernet multicastaddressing with several RoCE hardware implementations and software RoCEimplementations.

The inventors note that this experimental setup actually providedseveral unique advantages when it comes to being able to push the lossratio as low as possible. First of all, if one has already reached aloss ratio that is lower than one in a million using a setup involvingan electronic switch, it would be hard to determine if the loss werehappening in the switch or in the NIC itself. With the inventors ODBSSarchitecture, they are confident that if a packet is lost, it could onlyhappen in the transmitting or receiving ports, or the buffers which arealigned with them. Since we have more than one receiving port, if thetransmitting side loses the packet, all receiving sides should lose thatpacket. This rather simple feature is of great help in de-bugging andidentifying the root cause of packet loss.

Second, using a software RoCE implementation actually enabled theinventors to debug more effectively for several reasons:

-   -   the implementation is more transparent to as we have access to        the source code;    -   packets and messages can be tagged as needed for de-bugging        purposes; and    -   we can easily fix bugs when we identify them.

The inventors started testing with hardware RoCE implementations, butwhen they encountered packet loss, they could not make further progressuntil they switched to a software implementation. The packet lossobserved with the hardware RoCE NICs does not necessarily imply thatthere are bugs in the hardware implementation itself, but rather thatthe inventors could not pursue its root cause given the proprietarynature of the hardware implementation. The proof-of-concept test bed isdepicted in FIG. 16 wherein the 12 port PDXN 1610 is identified as arethe optical DMUX 1620.

After the inventors pushed the loss ratio to less than one in a hundredmillion, 1 in 10⁸, some unexpected bugs started to show up that couldonly be identified and fixed in the test-bed described above. Forinstance, after such a large number of packets are continuously sentout, the PSN number can become larger than its range and needs to bereset. Although this procedure is well defined and documented, it turnedout that the related algorithm in the Soft-RoCE C code was not completedto cover this rare case, which does not happen often unless a very largenumber of UD packets is sent. It is unknown if hardware implementationscover such rare cases with very large number of UD packets.

Last but not least, the practical know-how of building passive opticalcross-connects with inexpensive optical components made thisimplementation economically feasible. It is also evident that theinterdisciplinary nature of the work lead to the improvements in the lowloss performance of RoCE where the optical hardware played a key role inimproving the low-loss performance of RoCE, which in turn leads toachieving the multicast potential of this optical hardware.

2.2.B Low latency and Low Loss Ratio

It is instructive to do a quick comparison of the achievable latencyperformance with ODBSS+RDMA multicast versus that of overlay multicastand other hardware (i.e. switch-based) multicast. A good example of ahigh-performance overlay multicast is based on Binomial treeimplementation where a classic binomial multicast tree is depicted inFIG. 17.

The overlay binomial multicast latency can be thought of as being givenby Equation (1) below where L is the unicast latency, N is the nodecount, and K is a weighting factor which is dependent on how long eachnode has to wait to complete its task (and can therefore increasenonlinearly with N).

Latency=(K·(log₂(N))·L  (1)

At first glance, the latency of binomial overlay multicast does not growthat fast with the node count because the binomial algorithm buildschains with length of log 2(N). However, measurements show that latencyof binomial multicast actually grows nonlinearly with node count. Thisis due to two factors in the overlay implementation. The first isrelated to the long tail in unicast latency being much larger (35 μsversus 3 μs) than that of the average latency. The second is related tonodes on the chain needing to wait for the previous one to send them apacket before they can send. Therefore, the latency of chain(s) in thebinomial tree is vulnerable to the statistical nature of traffic in anetwork. These statistical fluctuations only worsen with extra trafficburden introduced by the binomial algorithm.

Hardware (i.e. switch-based) multicast, e.g. IP multicast or InfiniBandmulticast, in principle, should have better latency than overlaymulticast. For example, the latency of hardware-multicast basedalgorithms has been shown in the prior art to out-perform that ofbinomial overlay multicast. However, InfiniBand multicast (as well as IPmulticast) is lossy, which limits its potential use.

Unlike InfiniBand hardware multicast, the loss ratio of RDMA multicastover ODBSS is very low. In the inventor's test-bed demonstration theloss ratio has been pushed to as low as one in 68 billion packets. WithODBSS, if we stay within one dimension, the multicast latency iscomparable to the unicast latency. When scaling using multi-dimensions,the increase in multicast latency is weighted by the number ofdimensions, rather than by N (the number of nodes). As N increases, themulticast latency advantage grows nonlinearly when compared to overlaymulticast latency.

It is worthwhile to note that incast and the over-subscriptionmanagement is always a challenge for all multicast architectures.However, the proposed ODBSS architecture has advantages for incasttraffic because the selection happens at the end point. Even if one nodeis over-subscribed, it only affects that one particular node, butneither the ODBSS fabric, the sender, nor the other receiving nodes areimpacted.

2.2. C Enabling Low Latency Reliable Multicast

The low-latency low-loss-ratio optical multicast described as thepotential to become an important toolset for protocol designers who needa low-latency reliable multicast to implement consistency protocols.Given the very low loss ratio observed by the inventors for opticalmulticast, they believe it is practical to build a simple NACK-basedreliable multicast transport over ODBSS and RDMA Datagram.

As an example, Byzantine fault tolerance consistency protocols are builtusing reliable multicast, so it is conceivable that such protocols couldpotentially benefit from an intrinsically low-latency reliablemulticast. A low latency consistency protocol could shorten the timewindow available for traitorous processes to attack by enabling adistributed system to achieve consistency faster. Furthermore,traitorous processes would have their own consistency challenge if theyneed to collaborate among themselves using a secret communicationchannel, especially if their channel lacks this low latency advantage.

2.3 Comments

The architectures presented by the inventors provide for a scalablelow-latency, low loss-ratio transport-layer multicast solution bycombining the benefits of an optical cross-connect fabric (ODBSS) withRDMA. This combination in turn simplifies low-latency reliable multicastimplementation.

The inventors in comparing their implementation with the prior art haveidentified instance of employing optical couplers to build opticalswitch fabrics or demonstrate multicasting. Within the prior art Ni etal. in “PDXN: A New Passive Optical Cross-Connection Network for LowCost Power-Efficient Datacenters” (J. Lightwave Technology, 32(8), pp.1482-1500) have employed optical couplers, such as 1×N and N×N couplers,to build an optical switch fabric through a TDMA implementation. Incontrast, Samadi et al. in “Optical Multicast System for Data CenterNetworks” (Optics Express, 23(17), pp. 22162-22180) integrated 1×Npassive optical splitters within a hybrid network architecture combiningoptical circuit switching with electronic packet switching to reduce thecomplexity of multicast traffic flows.

Further, Samadi et al. in “Experimental Demonstration of One-to-ManyVirtual Machine Migration by Reliable Optical Multicast” (25^(th)European Conference on Optical Communication (ECOC);DOI:10.1109/ECOC.2015.7342006) an optical circuit switching networkdirects multicast traffic to a 1×N optical splitter whilst a separateelectronic packet switching network is employed for NACK control.

It would be evident that in contrast to the prior art no electronicpacket switch network as required by Samadi et al. Similarly, Ni et al.is silent to wavelength division multiplexing nor an ODBSS architecture.Further, the architecture proposed by the inventors due to the very lowloss ratio achievable allow simplified NACK control and reduced latency.

Embodiments of the invention as described above exploit a new opticalarchitecture in conjunction with RDMA to offer an intrinsicallylow-latency and low loss-ratio multicast channel. Building upon this, areliable multicast protocol is proposed to deliver a reliable,low-latency, and scalable multicast service to distributed systems. Byoffloading multicast traffic, these reliable low-latency multicastservice also improve the unicast performance of existing switch fabrics.Within a subnet, this optical hardware offers intrinsic ordering in thephysical layer. Also, RDMA maintains ordering within a message.

The inventors also note that these embodiments of the invention, throughtheir low-latency reliable multicast, can be employed in otherapplications such as fast data replication services, includingpublish/subscribe (Pub/Sub) services and distributed lock services,especially in use cases with fast Non-Volatile Memory Express OverFabric (NVMeOF) storage. Additionally, as mentioned above, ReliableDatagram (RD) is currently not supported by the RDMA implementations wehave tested, primarily because of the NAN to N! issue alluded toearlier. This makes it extremely hard to perform non-blocking broadcastin modern electrical packet switching systems. However, the proposedODBSS overcomes this obstacle allowing its use in implementing aReliable Datagram over the ODBSS architecture.

Specific details are given in the above description to provide athorough understanding of the embodiments. However, it is understoodthat the embodiments may be practiced without these specific details.For example, circuits may be shown in block diagrams in order not toobscure the embodiments in unnecessary detail. In other instances,well-known circuits, processes, algorithms, structures, and techniquesmay be shown without unnecessary detail in order to avoid obscuring theembodiments.

Implementation of the techniques, blocks, steps, and means describedabove may be done in various ways. For example, these techniques,blocks, steps, and means may be implemented in hardware, software, or acombination thereof. For a hardware implementation, the processing unitsmay be implemented within one or more application specific integratedcircuits (ASICs), digital signal processors (DSPs), digital signalprocessing devices (DSPDs), programmable logic devices (PLDs), fieldprogrammable gate arrays (FPGAs), processors, controllers,micro-controllers, microprocessors, other electronic units designed toperform the functions described above and/or a combination thereof.

Also, it is noted that the embodiments may be described as a processwhich is depicted as a flowchart, a flow diagram, a data flow diagram, astructure diagram, or a block diagram. Although a flowchart may describethe operations as a sequential process, many of the operations can beperformed in parallel or concurrently. In addition, the order of theoperations may be rearranged. A process is terminated when itsoperations are completed, but could have additional steps not includedin the figure. A process may correspond to a method, a function, aprocedure, a subroutine, a subprogram, etc. When a process correspondsto a function, its termination corresponds to a return of the functionto the calling function or the main function.

Furthermore, embodiments may be implemented by hardware, software,scripting languages, firmware, middleware, microcode, hardwaredescription languages and/or any combination thereof. When implementedin software, firmware, middleware, scripting language and/or microcode,the program code or code segments to perform the necessary tasks may bestored in a machine readable medium, such as a storage medium. A codesegment or machine-executable instruction may represent a procedure, afunction, a subprogram, a program, a routine, a subroutine, a module, asoftware package, a script, a class, or any combination of instructions,data structures and/or program statements. A code segment may be coupledto another code segment or a hardware circuit by passing and/orreceiving information, data, arguments, parameters and/or memorycontent. Information, arguments, parameters, data, etc. may be passed,forwarded, or transmitted via any suitable means including memorysharing, message passing, token passing, network transmission, etc.

For a firmware and/or software implementation, the methodologies may beimplemented with modules (e.g., procedures, functions, and so on) thatperform the functions described herein. Any machine-readable mediumtangibly embodying instructions may be used in implementing themethodologies described herein. For example, software codes may bestored in a memory. Memory may be implemented within the processor orexternal to the processor and may vary in implementation where thememory is employed in storing software codes for subsequent execution tothat when the memory is employed in executing the software codes. Asused herein the term “memory” refers to any type of long term, shortterm, volatile, nonvolatile, or other storage medium and is not to belimited to any particular type of memory or number of memories, or typeof media upon which memory is stored.

Moreover, as disclosed herein, the term “storage medium” may representone or more devices for storing data, including read only memory (ROM),random access memory (RAM), magnetic RAM, core memory, magnetic diskstorage mediums, optical storage mediums, flash memory devices and/orother machine readable mediums for storing information. The term“machine-readable medium” includes, but is not limited to portable orfixed storage devices, optical storage devices, wireless channels,and/or various other mediums capable of storing, containing, or carryinginstruction(s) and/or data.

The methodologies described herein are, in one or more embodiments,performable by a machine which includes one or more processors thataccept code segments containing instructions. For any of the methodsdescribed herein, when the instructions are executed by the machine, themachine performs the method. Any machine capable of executing a set ofinstructions (sequential or otherwise) that specify actions to be takenby that machine are included. Thus, a typical machine may be exemplifiedby a typical processing system that includes one or more processors.Each processor may include one or more of a CPU, a graphics-processingunit, and a programmable DSP unit. The processing system further mayinclude a memory subsystem including main RAM and/or a static RAM,and/or ROM. A bus subsystem may be included for communicating betweenthe components. If the processing system requires a display, such adisplay may be included, e.g., a liquid crystal display (LCD). If manualdata entry is required, the processing system also includes an inputdevice such as one or more of an alphanumeric input unit such as akeyboard, a pointing control device such as a mouse, and so forth.

The memory includes machine-readable code segments (e.g. software orsoftware code) including instructions for performing, when executed bythe processing system, one of more of the methods described herein. Thesoftware may reside entirely in the memory, or may also reside,completely or at least partially, within the RAM and/or within theprocessor during execution thereof by the computer system. Thus, thememory and the processor also constitute a system comprisingmachine-readable code.

In alternative embodiments, the machine operates as a standalone deviceor may be connected, e.g., networked to other machines, in a networkeddeployment, the machine may operate in the capacity of a server or aclient machine in server-client network environment, or as a peermachine in a peer-to-peer or distributed network environment. Themachine may be, for example, a computer, a server, a cluster of servers,a cluster of computers, a web appliance, a distributed computingenvironment, a cloud computing environment, or any machine capable ofexecuting a set of instructions (sequential or otherwise) that specifyactions to be taken by that machine. The term “machine” may also betaken to include any collection of machines that individually or jointlyexecute a set (or multiple sets) of instructions to perform any one ormore of the methodologies discussed herein.

The foregoing disclosure of the exemplary embodiments of the presentinvention has been presented for purposes of illustration anddescription. It is not intended to be exhaustive or to limit theinvention to the precise forms disclosed. Many variations andmodifications of the embodiments described herein will be apparent toone of ordinary skill in the art in light of the above disclosure. Thescope of the invention is to be defined only by the claims appendedhereto, and by their equivalents.

Further, in describing representative embodiments of the presentinvention, the specification may have presented the method and/orprocess of the present invention as a particular sequence of steps.However, to the extent that the method or process does not rely on theparticular order of steps set forth herein, the method or process shouldnot be limited to the particular sequence of steps described. As one ofordinary skill in the art would appreciate, other sequences of steps maybe possible. Therefore, the particular order of the steps set forth inthe specification should not be construed as limitations on the claims.In addition, the claims directed to the method and/or process of thepresent invention should not be limited to the performance of theirsteps in the order written, and one skilled in the art can readilyappreciate that the sequences may be varied and still remain within thespirit and scope of the present invention.

What is claimed is:
 1. A method of multicasting comprising: providing apassive optical cross-connect fabric; providing a set of first nodes,each first node connected to an input port of the passive opticalcross-connect fabric and transmitting on a predetermined wavelength of aset of wavelengths; providing a set of second nodes, each second nodeconnected to an output port of the passive optical cross-connect fabric;and transmitting data from a predetermined subset of the set of firstnodes to a predetermined subset of the set of second nodes using adirect memory access protocol; wherein all messages broadcast by eachfirst node of the set of first nodes are broadcast to all second nodesof the set of second nodes.
 2. The method according to claim 1, whereinthe direct memory access protocol is a remote direct memory accessprotocol.
 3. The method according to claim 1, wherein the direct memoryaccess protocol is a layer-4 transport protocol based upon a remotedirect memory access datagram.
 4. The method according to claim 1,wherein a message loss ratio of messages transmitted is better than oneor more of one in a billion, one in ten billion, one in fifty billion,and one in sixty billion.
 5. The method according to claim 1, wherein anend-to-end latency is less than 8 μs with 10 Gb/s Ethernet networkadapters.
 6. The method according to claim 1, wherein an end-to-endlatency is less than 2 μs with 10 Gb/s Ethernet network adapters.
 7. Themethod according to claim 1, wherein the passive optical cross-connectfabric does not include any optical switching elements.
 8. The methodaccording to claim 1, wherein the passive optical cross-connect fabricis an optically distributed broadcast and select switch; whereinselection by a second node of the set of second nodes of which firstnode of the set of first nodes to receive a message from is determinedby the second node of the set of second nodes.
 9. The method accordingto claim 1, wherein the passive optical cross-connect fabric with theset of first nodes and set of second nodes supportnegative-acknowledgement based layer 4 multicasting.
 10. The methodaccording to claim 1, wherein each second node comprises a selectorwhich receives as inputs the messages broadcast by the plurality ofnodes and selects for its output those messages broadcast by a firstnode of a plurality of first nodes.
 11. The method according to claim 1,wherein each second node comprises a selector for dynamically selectingmessages from received messages broadcast by the plurality of nodes; andeach first node is connected to the plurality of second nodes via anoptical bus and a plurality of taps where each tap of the plurality oftaps is associated with a second node of the plurality of second nodes.12. A method of multicasting comprising: providing a passive opticalcross-connect fabric comprising a plurality of first opticalcross-connections and a plurality of second optical cross-connections;providing a set of first nodes each transmitting on a predeterminedwavelength of a set of wavelengths; providing a set of second nodes eachtransmitting on a predetermined wavelength of a set of wavelengths;providing a plurality of third nodes each transmitting on apredetermined wavelength of a set of wavelengths; wherein predeterminedsubsets of the set of first nodes are connected to predetermined firstoptical cross-connections of the plurality of first opticalcross-connections; predetermined subsets of the set of second nodes areconnected to predetermined second optical cross-connections of theplurality of second optical cross-connections; and each third node ofthe plurality of third nodes is connected to another predetermined firstoptical cross-connection of the plurality of first opticalcross-connections and to another predetermined second opticalcross-connection of the plurality of second optical cross-connections.13. A method of multicasting comprising: providing a passive opticalcross-connect fabric having a plurality D dimensions, each dimensioncomprising a plurality of optical cross-connections; providing a set offirst nodes each transmitting on a predetermined wavelength of a set ofN wavelengths; providing a set of second nodes each transmitting on apredetermined wavelength of a set of D wavelengths; whereinpredetermined subsets of the set of first nodes are connected to a firstpredetermined optical cross-connections of the plurality of opticalcross-connections in a predetermined dimension of the plurality Ddimensions; each second node of the plurality of second nodes isconnected to D second predetermined optical cross-connections of theplurality of optical cross-connections where each second predeterminedoptical cross-connection of the D second predetermined opticalcross-connections of the plurality of optical cross-connections iswithin a different dimension of the D dimensions of the passive opticalcross-connect fabric such that each second node of the plurality ofsecond nodes is connected to all D dimensions; and N and D are positiveintegers.
 14. The method according to claim 1, wherein the plurality offirst nodes comprises M first nodes; and M=N{circumflex over ( )}D.